Information Extraction from Patients' Free Form Documentation

نویسندگان

  • Agnieszka Mykowiecka
  • Malgorzata Marciniak
چکیده

The paper presents two rule-based information extraction (IE) from two types of patients’ documentation in Polish. For both document types, values of sets of attributes were assigned using specially designed grammars. 1 Method/General Assumptions Various rule-based, statistical, and machine learning methods have been developed for the purpose of information extraction. Unfortunately, they have rarely been tested on Polish texts, whose rich inflectional morphology and relatively free word order is challenging. Here, we present results of two experiments aimed at extracting information from mammography reports and hospital records of diabetic patients.1 Since there are no annotated corpora of Polish medical text which can be used in supervised statistical methods, and we do not have enough data for weakly supervised methods, we chose the rule-based extraction schema. The processing procedure in both experiments consisted of four stages: text preprocessing, application of IE rules based on the morphological information and domain lexicons, postprocessing (data cleaning and structuring), and conversion into a relational database. Preprocessing included format unification, data anonymization, and (for mammography reports) automatic spelling correction. The extraction rules were defined as grammars of the SProUT system, (Drożdżyński et al., 2004). This work was partially financed by the Polish national project number 3 T11C 007 27. SProUT consists of a set of processing components for basic linguistic operations, including tokenization, sentence splitting, morphological analysis (for Polish we use Morfeusz (Woliński, 2006)) and gazetteer lookup. The SproUT components are combined into a pipeline that generates typed feature structures (TFS), on which rules in the form of regular expressions with unification can operate. Small specialized lexicons containing both morphological and semantic (concept names) information have been created for both document types. Extracted attribute values are stored in a relational database.2 Before that, mammography reports results undergo additional postprocessing — grouping together of extracted data. Specially designed scripts put limits that separate descriptions of anatomical changes, tissue structure, and diagnosis. More details about mammography IE system can be found in (Mykowiecka et al., 2005).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Processing of Diabetic Patients' Hospital Documentation

The paper presents a rule-based information extraction (IE) system for Polish medical texts. We select the most important information from diabetic patients’ records. Most data being processed are free-form texts, only a part is in table form. The work has three goals: to test classical IE methods on texts in Polish, to create relational database containing the extracted data, and to prepare an...

متن کامل

Impact of Controlled and Free Language Use in Retrieving Articles from the ProQuest and Science Direct Databases

Abstract Introduction: The growth and expansion of the Internet has changed the way information is accessed and many facilities have been created on the Web to facilitate and expedite information locating. Objective: To identify the impact of keyword documentation using the medical thesaurus on the retrieval of articles from Proquest and Science Direct databases. Materials and Methods:The pr...

متن کامل

Extracting Medical Information Using Linked Data

Medical documentation is usually represented as a free text. That makes it almost unusable for all other analytical purposes. It would be a great advantage to have at least some information structured in a machine readable form. In this contribution we introduce a system which performs information extraction from partially structured texts such as discharge summaries. It is a rule-based system ...

متن کامل

Information Extraction from Online XML-encoded Documents

Online reference documents tend to be semi-formatted in that they contain repeated sections with similar structure, and have free-text inside each section. XML (extensible markup language) enables document designers to design rich tag sets where tags for section headings contain information about each section. This contextual information, coupled with the fact that the free-text sections of the...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007